According to CDC, heart disease is the leading cause of death in the United States. 47% of the people in the United States have at least one of three key risk factors for heart disease: high blood pressure, high blood cholesterol, and smoking. Other key factors like diabetic status, BMI, not getting enough physical activity or drinking too much alcohol can also contribute to heart disease. Thus, detecting and preventing risk factors that have the greatest impact on heart disease is important to healthcare. In this project, the main factors that are being analyzed are BMI, sleep, physical health, diabetic status (yes or no), alcohol consumption (yes or no), smoking, physical activity. These factors are being analyzed to see the relationship between itself and the outcome (have heart disease or not) and determine whether these variables are statistically significant associated with heart disease. Multivariable regression analysis was also performed to evaluate the relationship between the heart disease and the combination of all the factors.
Data and indicators:
The data was from 2020 with 319795 data points and 18 variables. The dataset was obtained from Kaggle (https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease/data)
The heart disease health indicators based on the BRFSS 2015 Codebook (https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf) BMI: Body Mass Index Smoking: Have you smoked at least 100 cigarettes in your entire life? Physical activity: Adults who reported doing physical activity/exercise during the past 30 days other than their regular jobs Diabetic status: Were you ever told you have diabetes? Physical health: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.25.0 (2022-06-12 02:20:02 UTC) successfully loaded. See ?R.oo for help.
Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':
throw
The following objects are masked from 'package:methods':
getClasses, getMethods
The following objects are masked from 'package:base':
attach, detach, load, save
R.utils v2.12.2 (2022-11-11 22:00:03 UTC) successfully loaded. See ?R.utils for help.
Attaching package: 'R.utils'
The following object is masked from 'package:tidyr':
extract
The following object is masked from 'package:utils':
timestamp
The following objects are masked from 'package:base':
cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
Rows: 319795 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
dbl (4): BMI, PhysicalHealth, MentalHealth, SleepTime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data inspection includes: check dimensions using dim(heart), check the first few rows (headers) using head(heart), check the last few rows (footers) using tail(heart), and check variable names and types using str(heart).
Rename variables to shorter length and all lower case
#Categorize bmi into groups (category is based on CDC standard)heart$bmi_group <-ifelse(heart$bmi <18.5, "underweight",ifelse(heart$bmi >=18.5& heart$bmi <25, "healthy",ifelse(heart$bmi >=25& heart$bmi <30, "overweight",ifelse(heart$bmi >=30, "obese","not obese"))))table(heart$bmi_group)
#Change yes, no into binary 1,0library(dplyr)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
heart <- heart %>%mutate(heartdis_bi =if_else(heartdis =="Yes", 1, 0))#Heart disease outcome by bmi groups in plot (data visualisation)heart <- heart %>%mutate(outcome_heart =factor(heartdis_bi))plot_bmi<- heart %>%ggplot()+geom_bar(mapping =aes(x = bmi_group, fill = outcome_heart))+labs(title ="Heart disease outcome by bmi groups")ggplotly(plot_bmi)
The plot shows that most people in the study are either obese or overweight. Majority of the people in the study who have heart disease are in obese and overweight group, while the least people are in the underweight group.
#Is BMI a significant risk factor for heart disease?#Regression model for bmioverallbmi <-glm (outcome_heart ~ bmi, data = heart, family = binomial)summary(overallbmi)
Call:
glm(formula = outcome_heart ~ bmi, family = binomial, data = heart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.9874047 0.0273202 -109.3 <2e-16 ***
bmi 0.0234955 0.0009071 25.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 183054 on 301716 degrees of freedom
Residual deviance: 182416 on 301715 degrees of freedom
AIC: 182420
Number of Fisher Scoring iterations: 5
In this logistic regression model, the dependent variable is whether an individual has heart disease (0 for no heart disease, 1 for having heart disease) and the independent variable is BMI. For a one-unit increase in BMI, the estimated log-odds of having heart disease increase by 0.0234955 and the p-value being extremely close to zero (p< 2 x 10^-16) suggest that there is an evidence that BMI is associated with the likelihood of having heart disease.
Breaking BMI into categories makes it easier to interpret the model in terms of risk associated with different BMI categories.
#Regression model for bmi groupsreg_bmi<-glm (outcome_heart ~ bmi_group, data = heart, family = binomial)summary(reg_bmi)
Call:
glm(formula = outcome_heart ~ bmi_group, family = binomial, data = heart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.59859 0.01307 -198.83 < 2e-16 ***
bmi_groupobese 0.46838 0.01661 28.20 < 2e-16 ***
bmi_groupoverweight 0.33323 0.01680 19.84 < 2e-16 ***
bmi_groupunderweight 0.14108 0.05365 2.63 0.00855 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 183054 on 301716 degrees of freedom
Residual deviance: 182204 on 301713 degrees of freedom
AIC: 182212
Number of Fisher Scoring iterations: 5
Coefficients for BMI groups: looking at the effect of different BMI categories on the likelihood of a person having a heart disease compared to healthy group.
The response variable being 1 indicates a person having a heart disease, 0 indicates a person doesn’t have a heart disease. For the obese group, the coefficient of 0.46838 indicates that, compared to individuals in healthy group, individuals in obese group have about 0.46838 increase in log-odds of having heart disease. Same logic applies to individuals who are overweight (coefficient: 0.33323) and underweight (0.14108). All p-values associated with these coefficients are <0.05, which suggest that the effect of BMI categories on the likelihood of having heart disease is statistically significant. The coefficients show that there is a positive association between higher BMI categories and the likelihood of having heart disease.
library(mfp)
Loading required package: survival
Attaching package: 'survival'
The following object is masked _by_ '.GlobalEnv':
heart
#heart disease and sleep time relationshipheart$sleep_group <-ifelse(heart$sleep <7, "lack of sleep",ifelse(heart$sleep >=7, "enough sleep", "no sleep"))model_sleep<-glm(outcome_heart ~ sleep_group, family = binomial,data = heart)summary(model_sleep)
Call:
glm(formula = outcome_heart ~ sleep_group, family = binomial,
data = heart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.344716 0.007781 -301.343 < 2e-16 ***
sleep_grouplack of sleep 0.109211 0.013468 8.109 5.11e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 183054 on 301716 degrees of freedom
Residual deviance: 182989 on 301715 degrees of freedom
AIC: 182993
Number of Fisher Scoring iterations: 5
#stacked histogram for heart disease vs sleep hoursleep1 <-ggplot(heart, aes(x=sleep, color=outcome_heart, fill=outcome_heart)) +scale_fill_manual(values =c("skyblue", "pink")) +scale_colour_manual(values =c("black", "black")) +geom_histogram()ggplotly(sleep1)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The results of the regression analysis for heart disease with sleep groups (<7 hours of sleep are indicated as “lack of sleep,” >=7 hours of sleep are indicated as “enough sleep”) suggest that the individuals in “lack of sleep” group have an estimated increase of 0.109211 in the log-odds of having heart disease compared to those with enough sleep, assuming all other factors are held constant. The increase is statistically significant as the p-value is 5.11 x 10^-16, which is less than .05. The stacked histogram shows that the majority of the individuals who have heart disease have an average sleep of 7.93 hours. Also based on the visualisation through the boxplot, it seems like there is no obvious difference between sleep hours and heart disease as the distributions for heart disease and no heart disease are relatively similar. Therefore, exploring the correlation to understand the relationship can provide more insights.
# Calculate the correlation between heart disease and sleep hourslibrary(polycor)cor_sleep <-cor(heart$sleep, as.numeric(heart$outcome_heart))print(cor_sleep)
[1] 0.01083366
model1<-glm(outcome_heart ~ sleep, family = binomial,data = heart)summary(model1)
Call:
glm(formula = outcome_heart ~ sleep, family = binomial, data = heart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.490325 0.031157 -79.929 < 2e-16 ***
sleep 0.025466 0.004278 5.953 2.64e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 183054 on 301716 degrees of freedom
Residual deviance: 183019 on 301715 degrees of freedom
AIC: 183023
Number of Fisher Scoring iterations: 5
The correlation coefficient is very close to 0 (0.0108336), but the p-value is statistically significant (p<.0001). It implies that there is a significant relationship between the variables sleep hours and heart disease, but the strength of the relationship is weak. This situation can suggest that the sample size is large (the dataset contains 319795 data points), could consider other variables involved, and there might be a non-linear relationship as linear relationship is what the Pearson correlation coefficient measures.
The categorical variables below (physical health, diabetic status, alcohol status, smoking status, physical activity) were each ran through logistic regression model to see the association between the variable and the outcome variable (heart disease).
#heart disease vs physical healthmfp::mfp(outcome_heart ~fp(phys_H), family = binomial, data = heart)
Call:
mfp::mfp(formula = outcome_heart ~ fp(phys_H), data = heart,
family = binomial)
Deviance table:
Resid. Dev
Null model 183053.8
Linear model 176719.9
Final model 176404.5
Fractional polynomials:
df.initial select alpha df.final power1 power2
phys_H 4 1 0.05 4 -2 0
Transformations of covariates:
formula
phys_H I(((phys_H+1)/10)^-2)+log(((phys_H+1)/10))
Coefficients:
Intercept phys_H.1 phys_H.2
-1.830579 0.005437 0.574534
Degrees of Freedom: 301716 Total (i.e. Null); 301714 Residual
Null Deviance: 183100
Residual Deviance: 176400 AIC: 176400
glm(outcome_heart ~ phys_H, family = binomial, data = heart) %>%summary()
Call:
glm(formula = outcome_heart ~ phys_H, family = binomial, data = heart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.5646860 0.0075732 -338.65 <2e-16 ***
phys_H 0.0495475 0.0005772 85.84 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 183054 on 301716 degrees of freedom
Residual deviance: 176720 on 301715 degrees of freedom
AIC: 176724
Number of Fisher Scoring iterations: 5
The result shows the association between heart disease vs physical health function is statistically significant (p<2x10^-16).
#heart disease vs diabetic statusheart <- heart %>%mutate(diabetic_status =if_else(diabetic =="Yes", 1, 0))mfp::mfp(outcome_heart ~fp(diabetic_status), family = binomial, data = heart)
Warning: glm.fit: algorithm did not converge
Call:
mfp::mfp(formula = outcome_heart ~ fp(diabetic_status), data = heart,
family = binomial)
Deviance table:
Resid. Dev
Null model 183053.8
Linear model 175460.4
Final model 175460.4
Fractional polynomials:
df.initial select alpha df.final power1 power2
diabetic_status 4 1 0.05 1 1 .
Transformations of covariates:
formula
diabetic_status I((diabetic_status+1)^1)
Coefficients:
Intercept diabetic_status.1
-3.906 1.322
Degrees of Freedom: 301716 Total (i.e. Null); 301715 Residual
Null Deviance: 183100
Residual Deviance: 175500 AIC: 175500
glm(outcome_heart ~ diabetic_status, family = binomial, data = heart) %>%summary()
Call:
glm(formula = outcome_heart ~ diabetic_status, family = binomial,
data = heart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.584689 0.007663 -337.29 <2e-16 ***
diabetic_status 1.321665 0.014216 92.97 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 183054 on 301716 degrees of freedom
Residual deviance: 175460 on 301715 degrees of freedom
AIC: 175464
Number of Fisher Scoring iterations: 5
The result shows the association between heart disease vs diabetic status function is statistically significant (p<2x10^-16).
#heart disease vs alcohol statusheart <- heart %>%mutate(alc_status =if_else(alc =="Yes", 1, 0))mfp::mfp(outcome_heart ~fp(alc_status), family = binomial, data = heart)
Call:
mfp::mfp(formula = outcome_heart ~ fp(alc_status), data = heart,
family = binomial)
Deviance table:
Resid. Dev
Null model 183053.8
Linear model 182597.8
Final model 182597.8
Fractional polynomials:
df.initial select alpha df.final power1 power2
alc_status 4 1 0.05 1 1 .
Transformations of covariates:
formula
alc_status I((alc_status+1)^1)
Coefficients:
Intercept alc_status.1
-1.6638 -0.6109
Degrees of Freedom: 301716 Total (i.e. Null); 301715 Residual
Null Deviance: 183100
Residual Deviance: 182600 AIC: 182600
glm(outcome_heart ~ alc_status, family = binomial, data = heart) %>%summary()
Call:
glm(formula = outcome_heart ~ alc_status, family = binomial,
data = heart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.274696 0.006498 -350.07 <2e-16 ***
alc_status -0.610893 0.031105 -19.64 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 183054 on 301716 degrees of freedom
Residual deviance: 182598 on 301715 degrees of freedom
AIC: 182602
Number of Fisher Scoring iterations: 5
The association between heart disease vs alcohol status function is statistically significant (p<2x10^-16).
#heart disease vs smoking statusheart <- heart %>%mutate(smoke =if_else(smoking =="Yes", 1, 0))mfp::mfp(outcome_heart ~fp(smoke), family = binomial, data = heart)
Call:
mfp::mfp(formula = outcome_heart ~ fp(smoke), data = heart, family = binomial)
Deviance table:
Resid. Dev
Null model 183053.8
Linear model 179804.8
Final model 179804.8
Fractional polynomials:
df.initial select alpha df.final power1 power2
smoke 4 1 0.05 1 1 .
Transformations of covariates:
formula
smoke I((smoke+1)^1)
Coefficients:
Intercept smoke.1
-3.3988 0.7283
Degrees of Freedom: 301716 Total (i.e. Null); 301715 Residual
Null Deviance: 183100
Residual Deviance: 179800 AIC: 179800
glm(outcome_heart ~ smoke, family = binomial, data = heart) %>%summary()
Call:
glm(formula = outcome_heart ~ smoke, family = binomial, data = heart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.670536 0.009734 -274.35 <2e-16 ***
smoke 0.728308 0.012896 56.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 183054 on 301716 degrees of freedom
Residual deviance: 179805 on 301715 degrees of freedom
AIC: 179809
Number of Fisher Scoring iterations: 5
The association between heart disease vs smoking function is statistically significant (p<2x10^-16).
#heart disease vs physical activityheart <- heart %>%mutate(phys =if_else(phys_A =="Yes", 1, 0))mfp::mfp(outcome_heart ~fp(phys), family = binomial, data = heart)
Warning: glm.fit: algorithm did not converge
Call:
mfp::mfp(formula = outcome_heart ~ fp(phys), data = heart, family = binomial)
Deviance table:
Resid. Dev
Null model 183053.8
Linear model 180633.7
Final model 180633.7
Fractional polynomials:
df.initial select alpha df.final power1 power2
phys 4 1 0.05 1 1 .
Transformations of covariates:
formula
phys I((phys+1)^1)
Coefficients:
Intercept phys.1
-1.1480 -0.6791
Degrees of Freedom: 301716 Total (i.e. Null); 301715 Residual
Null Deviance: 183100
Residual Deviance: 180600 AIC: 180600
glm(outcome_heart ~ phys, family = binomial, data = heart) %>%summary()
Call:
glm(formula = outcome_heart ~ phys, family = binomial, data = heart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.82707 0.01084 -168.57 <2e-16 ***
phys -0.67912 0.01341 -50.66 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 183054 on 301716 degrees of freedom
Residual deviance: 180634 on 301715 degrees of freedom
AIC: 180638
Number of Fisher Scoring iterations: 5
The association between heart disease vs physical activity function is statistically significant (p<2x10^-16). All categorical variables show that the association with heart disease is statistically significant. It is also worth to see the overall relationship with heart disease when all the categorical variables and continuous variables (bmi, sleep) are considered together.
Call:
glm(formula = outcome_heart ~ bmi + sleep + phys_H + diabetic_status +
alc_status + smoke + phys, family = binomial, data = heart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.0351224 0.0448687 -67.645 <2e-16 ***
bmi -0.0010043 0.0009957 -1.009 0.313
sleep 0.0381331 0.0039993 9.535 <2e-16 ***
phys_H 0.0363439 0.0006275 57.920 <2e-16 ***
diabetic_status 1.0928719 0.0152434 71.695 <2e-16 ***
alc_status -0.5640942 0.0318038 -17.737 <2e-16 ***
smoke 0.6212631 0.0133844 46.417 <2e-16 ***
phys -0.2858621 0.0146181 -19.555 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 183054 on 301716 degrees of freedom
Residual deviance: 168069 on 301709 degrees of freedom
AIC: 168085
Number of Fisher Scoring iterations: 5
All the variables besides BMI show that they are statistically significant associated with heart disease. However, when doing the heart disease vs BMI alone (bivariate relationship), it showed statistically significant association between bmi and heart disease. There could be multicollinearity where the predictors are correlated with each other making it difficult to isolate their individual effects. Thus, it is worth seeing if each BMI category has a different effect on the log-odds of the outcome compared to the reference category (in this case, the healthy bmi group). Putting BMI into groups can sometimes help to capture non-linear relationships or multicollinearity to better represent the underlying structure of the data.
Call:
glm(formula = outcome_heart ~ bmi_group + sleep + phys_H + diabetic_status +
alc_status + smoke + phys, family = binomial, data = heart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.1801330 0.0354484 -89.712 < 2e-16 ***
bmi_groupobese 0.1042983 0.0176469 5.910 3.41e-09 ***
bmi_groupoverweight 0.2130307 0.0173049 12.310 < 2e-16 ***
bmi_groupunderweight -0.0213053 0.0551619 -0.386 0.699
sleep 0.0388455 0.0040016 9.707 < 2e-16 ***
phys_H 0.0365086 0.0006276 58.174 < 2e-16 ***
diabetic_status 1.0753132 0.0152059 70.717 < 2e-16 ***
alc_status -0.5609814 0.0318131 -17.634 < 2e-16 ***
smoke 0.6194444 0.0133861 46.275 < 2e-16 ***
phys -0.2840472 0.0146049 -19.449 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 183054 on 301716 degrees of freedom
Residual deviance: 167907 on 301707 degrees of freedom
AIC: 167927
Number of Fisher Scoring iterations: 5
For individuals categorized as underweight compared to the reference category (bmi healthy group), the estimated log-odds of having heart disease decrease by -0.0213053 but it is not statistically significant (p=0.699). All estimates have significant p-values (p<2x10^-16, and p-value for bmi_groupobese is 3.41 x 10^-9) indicating evidence of association between these predictors and the likelihood of having heart disease.
Conclusion: According to CDC and aside from the key risk factors like high blood pressure, high blood cholesterol, and smoking, several other medical conditions and lifestyle choices can put people at a higher risk for heart disease which include but not limited to diabetes, physical inactivity, and overweight and obesity. Based on the analysis performed, diabetes, physical inactivity, high bmi, smoking, sleep hours, alcohol consumption, and overall physical health status are have statistically significant association to heart disease.